{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我是参加DataCastle猫狗大战的选手，kuhung。在测评中，我提交的数据集最后评分0.98639。以下是我的备战过程及心得体会。（最后有完整代码及较全面的注释）\n",
    "\n",
    "## 个人介绍\n",
    "华中科技大学机械学院的大二（准大三）学生，接触数据挖掘快有一年了。早期在学生团队做过一些D3数据可视化方面的工作，今年上半年开始数据挖掘实践。想把这个爱好发展成事业。做过阿里的天池竞赛，也有在kaggle混迹。算个数据新手，但一直不承认：你是新人，所以成绩不好看没啥关系。 \n",
    "\n",
    "## 初识比赛\n",
    "第一次接触数据集，就感觉有些难度。因为以前没做过图片分类的比赛，更没想过要用深度学习的神经网络进行识别。思索一番，还是觉得特征提取后，使用决策树靠谱。自己也下去找过资料，发现并不容易实现。期间，还曾一度想过肉眼识别。但打开文件，看到那1400+图片，就觉得这时间花在肉眼识别上不值。中间一度消停。\n",
    "\n",
    "## 初见曙光——yinjh战队分享\n",
    "后来上论坛逛过几次。一次偶然的机会，让我看到了yinjh团队分享的vgg16模型。乍一看，代码简单、效果不错。更为重要的是，这个模型自己以前从未见过。于是抱着验证学习的态度，我把代码扣了下来，打算自己照着做一遍。\n",
    "\n",
    "## 过程艰难\n",
    "一开始，我就把一屏的代码放进了我的jupyter notebook中，一步一步试水。很明显，我的很多依赖包都没安装，所以也是错误不断。早先是在Windows系统下，使用python2.7，需要什么包，就安装什么包。在安装keras过程中，我发现了Anaconda——很好用的一个科学计算环境，集成了各种数据挖掘包。即使是这样，仍然是满屏的错误，亟待排查。\n",
    "\n",
    "## 步步优化\n",
    "离比赛截止就还只有几天，一边准备期末考试，一边焦急地排查bug。Windows系统下仍有个别难以解决的错误，我索性切换到了做NAO机器人时装的Ubuntu系统下。结合keras给的官方文档，我对原代码进行了函数拆分解耦，又在循环体部分增加了异常检测。综合考虑性能，稍微修改了循环结构。下载好训练的vgg16_weights，在没有错误之后，焦急地等待25分钟后，屏幕开始打印结果。\n",
    "\n",
    "## 欣喜万分\n",
    "第一次提交，随便截取了前面一段，没成绩。折腾了几次，才发现是提交的格式出了问题。后面取p=0.99+部分，提交结果在0.58左右，数据集大概有90个。估计了下，狗狗总数应该在180左右。第二次提交，取了180左右，结果0.97多一点。第三次，也是最后一次提交，取了result前189个，结果0.98639，一举升到第一。\n",
    "\n",
    "---\n",
    "### 比赛总结\n",
    "这次比赛，首先还得感谢yinjh团队的yin前辈。如果没有您分享的代码，就不会有我今天的成绩。感谢您分享的代码，感想您在我写这篇分享时提供的代码指导。\n",
    "再者，感谢我的女票晶晶，谢谢你一直陪在我身边，谢谢你包容我写代码时不那么快的回复手速。我是新手，但我一直不觉得成绩低是理所当。立志从事这一行，就需要快速地学习、快速地成长。新人，也需要做到最好。当然，自己目前还存在很多问题。一些基本的概念只是模糊掌握，需要更多的实践，需要更多的理论积淀，而不是简单地做一个调包侠。\n",
    "\n",
    "### 给新手的建议\n",
    "- 善用搜索引擎，多读官方文档，不要一开始就依赖Google。\n",
    "- Google Groups、Stack Overflow、GitHub是好东西。\n",
    "- 干！就是干！"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** ------------------------------------------------------------------------------------ **"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 完整代码\n",
    "- ** 以下操作均在Ubuntu14.04+Anaconda中进行 **\n",
    
    "### 导入python标准包 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os   # 处理字符串路径\n",
    "\n",
    "import glob  # 用于查找文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 导入相关库\n",
    "-  keras\n",
    " - keras是基于Theano的深度学习(Deep Learning)框架 \n",
    "\n",
    " - 详细信息请见[keras官方文档](http://keras.io/) \n",
    " \n",
    "##### 安装过程\n",
    "    \n",
    "   > conda update conda\n",
    "   \n",
    "   > conda update --all\n",
    "   \n",
    "   > conda install mingw libpython\n",
    "    \n",
    "   > pip install git+git://github.com/Theano/Theano.git\n",
    "    \n",
    "   > pip install git+git://github.com/fchollet/keras.git\n",
    "\n",
    "-  cv2 \n",
    " - OpenCV库\n",
    " \n",
    "     > conda isntall opnecv \n",
    "-  numpy\n",
    " - Anaconda自带"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from keras.models import Sequential\n",
    "\n",
    "from keras.layers.core import Flatten, Dense, Dropout\n",
    "\n",
    "from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D\n",
    "\n",
    "from keras.optimizers import SGD\n",
    "\n",
    "import cv2, numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用keras建立vgg16模型\n",
    " - 参考官方示例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def VGG_16(weights_path=None):\n",
    "\n",
    "    model = Sequential()\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1),input_shape=(3,224,224)))\n",
    "\n",
    "    model.add(Convolution2D(64, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(64, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(MaxPooling2D((2,2), strides=(2,2)))\n",
    "\n",
    "\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(128, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(128, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(MaxPooling2D((2,2), strides=(2,2)))\n",
    "\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(256, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(256, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(256, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(MaxPooling2D((2,2), strides=(2,2)))\n",
    "\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(512, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(512, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(512, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(MaxPooling2D((2,2), strides=(2,2)))\n",
    "\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(512, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(512, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(ZeroPadding2D((1,1)))\n",
    "\n",
    "    model.add(Convolution2D(512, 3, 3, activation='relu'))\n",
    "\n",
    "    model.add(MaxPooling2D((2,2), strides=(2,2)))\n",
    "\n",
    "\n",
    "    model.add(Flatten())\n",
    "\n",
    "    model.add(Dense(4096, activation='relu'))\n",
    "\n",
    "    model.add(Dropout(0.5))\n",
    "\n",
    "    model.add(Dense(4096, activation='relu'))\n",
    "\n",
    "    model.add(Dropout(0.5))\n",
    "\n",
    "    model.add(Dense(1000, activation='softmax'))\n",
    "\n",
    "\n",
    "    if weights_path:\n",
    "\n",
    "        model.load_weights(weights_path)\n",
    "\n",
    "\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 引入训练好的vgg16_weights模型\n",
    "** Note： ** \n",
    "- vgg16_weights.h5需单独下载，并与代码文件处于同一文件夹下，否则会报错。\n",
    " - 网上有资源 附百度云盘链接 [vgg16_weights.h5下载](http://pan.baidu.com/s/1qX0CJSC)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = VGG_16('vgg16_weights.h5')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)\n",
    "model.compile(optimizer=sgd, loss='categorical_crossentropy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 猫和狗的特征"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dogs=[251, 268, 256, 253, 255, 254, 257, 159, 211, 210, 212, 214, 213, 216, 215, 219, 220, 221, 217, 218, 207, 209, 206, 205, 208, 193, 202, 194, 191, 204, 187, 203, 185, 192, 183, 199, 195, 181, 184, 201, 186, 200, 182, 188, 189, 190, 197, 196, 198, 179, 180, 177, 178, 175, 163, 174, 176, 160, 162, 161, 164, 168, 173, 170, 169, 165, 166, 167, 172, 171, 264, 263, 266, 265, 267, 262, 246, 242, 243, 248, 247, 229, 233, 234, 228, 231, 232, 230, 227, 226, 235, 225, 224, 223, 222, 236, 252, 237, 250, 249, 241, 239, 238, 240, 244, 245, 259, 261, 260, 258, 154, 153, 158, 152, 155, 151, 157, 156]\n",
    "\n",
    "cats=[281,282,283,284,285,286,287]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 待处理文件导入\n",
    "** Note: **\n",
    "- 将测试集改名为test，放入imgs文件夹下，imgs文件夹又与此代码处于同一文件夹下。\n",
    "- 当然，你也可以修改下面的路径。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "path = os.path.join('imgs', 'test', '*.jpg')  #拼接路径\n",
    " \n",
    "files = glob.glob(path) #返回路径"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 定义几个变量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result=[]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "flbase=0\n",
    "p=0\n",
    "temp=0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 定义图像加载函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def load_image(imageurl):\n",
    "    im = cv2.resize(temp ,(224,224)).astype(np.float32)\n",
    "    im[:,:,0] -= 103.939\n",
    "    im[:,:,1] -= 116.779\n",
    "    im[:,:,2] -= 123.68\n",
    "    im = im.transpose((2,0,1))\n",
    "    im = np.expand_dims(im,axis=0)\n",
    "    return im    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 定义预测函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def predict(url):\n",
    "    im = load_image(url)        \n",
    "    out = model.predict(im)\n",
    "    flbase = os.path.basename(url)\n",
    "    p = np.sum(out[0,dogs]) / (np.sum(out[0,dogs]) + np.sum(out[0,cats]))\n",
    "    result.append((flbase,p))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 开始预测\n",
    "** Note： **\n",
    "- 此处的if,else异常检测很重要，因为cv2.imread(fl)在遇到某几张图时会为空，抛出错误，程序中途停止，图片集得不到完全检测。\n",
    "- 一般配置电脑跑这部分时，大约需要20～30分钟，不是程序没有工作，请耐心等待。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for fl in files:\n",
    "    temp=cv2.imread(fl) \n",
    "    if  temp ==None:  \n",
    "        pass\n",
    "    else:\n",
    "        predict(fl)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 对结果进行排序"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "result=sorted(result, key=lambda x:x[1], reverse=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 打印预测结果与相应概率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for x in result:\n",
    "    print x[0],x[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 预测结果\n",
    "- 根据上面的概率，选择相应的前多少张图片\n",
    "- 复制进csv文件，使用一般编辑器将\".jpg\"以空格替代"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for x in result:\n",
    "    print x[0]"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
